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ABSTRACT 

The information capacity of Kanerva s Sparse, Distributed Memory (SDM) and Hopfeld-type 
neural networks is investigated. Under the approximations used here, it is shown that the to- 
tal information stored in these systems is proportional to the number connections in the net- 
work. The proportionality constant is the same for the SDM and Hopfield-type models in- 
dependent of the particular model, or the order of the model . The approximations are 
checked numerically. This same analysis can be used to show that the SDM can store se- 
quences of spatiotemporal patterns, and the addition of time-delayed connections allows the 
retrieval of context dependent temporal patterns . A minor modification of the SDM can be 
used to store correlated patterns. 


INTRODUCTION 

Many different models of memory and thought have been proposed by scientists over the 
years. In (1943) McCulloch and Pitts proposed a simple model neuron with two states of activity 
(on and off) and a large number of inputs/ Hebb (1949) considered a network of such neurons and 
postulated mechanisms for changing synaptic strengths 2 to leam memories. The learning rule 
considered here uses the outer-product of patterns of +ls and -Is. Anderson (1977) discussed the 
effect of iterative feedback in such a system. 3 Hopfield (1982) showed that for symmetric connec- 
tions, 4 the dynamics of such a network is governed by an energy function that is analogous to the 
energy function of a spin glass. 5 Numerous investigations have been carried out on similar 

models. 6 ^ 8 

Several limitations of these binary interaction, outer-product models have been pointed out. 
For example, the number of patterns that can be stored in the system (its capacity) is limited to a 
fraction of the length of the pattern vectors. Also, these models are not very successful at storing 
correlated patterns or temporal sequences. 

Other models have been proposed to overcome these limitations. For example, one can 
allow higher-order interactions among the neurons. 9,10 In the following, I focus on a 
developed by Kanerva (1984) called the Sparse, Distributed Memory (SDM) model. The SDM 
can be viewed as a three layer network that uses an outer-product learning between the second and 
third layer. As discussed below, the SDM is more versatile than the above mentioned networks 
because the number of stored patterns can increased independent of the length of the pattern, and 
the SDM can be used to store spatiotemporal patterns with context retrieval, and store correlated 

patterns. 

The capacity limitations of outer-product models can be alleviated by using higher-order 
interaction models or the SDM, but a price must be paid for this added capacity in terms of an 
increase in the number of connections. How much information is gained per connection. It is 
shown in the following that the total information stored in each system is proportional to the 
number of connections in the network, and that the proportionality constant is independent of the 
particular model or the order of the model. This result also holds if the connections are limited to 
one bit of precision (clipped weights). The analysis presented here requires certain simplifying 
assumptions. The approximate results are compared numerically to an exact calculation developed 
by Chou. 12 


SIMPLE OUTER-PRODUCT NEURAL NETWORK MODEL 

As an example or a simple first-order neural network model, I consider in detail the model 
developed by Hopfield. 4 This model will be used to introduce the mathematics and the concepts 
that will be generalized for the analysis of the SDM. The “neurons are simple two- state 



threshold devices: The state of the i th neuron, u,, is either either +1 (on), or -1 (ofO. Consider a 
set of n such neurons with net input (local field), h it to the i th neuron given by 

hi = u j * ( 1 ) 

J 

where Tjj represents the interaction strength between the i th neuron and the j rh . The state of each 
neuron is updated asynchronously (at random) according to the rule 

u, <-*(*,•). (2) 

where the function g is a simple threshold function g (x ) = sign (jt ). 

Suppose we are given M randomly chosen patterns (strings of length n of ±ls) which we 
wish to store in this system. Denote these M memory patterns as pattern vectors: 
p a = (p{* • *Ph) * ol = 1,2,3, . . . ,Af . For example, p l might look like 

(+1,— l,+l,-l t -l,.. M +l). One method of storing these patterns is the outer-product (Hebbian) learn- 
ing rule: Start with T *0, and accumulate the outer-products of the pattern vectors. The resulting 
connection matrix is given by 

Tij = 2> °P; a , T u = 0. (3) 

art 


The system described above is a dynamical system with attracting fixed points. To obtain 
an approximate upper bound on the total information stored in this network, we sidestep the issue 
of the basins of attraction, and we check to see if each of the patterns stored by Eq. (3) is actually 
a fixed point of (2). Suppose we are given one of the patterns, p^, say, as the initial configuration 
of the neurons. I will snow that p p is expected to be a fixed point of Eq. (2). After inserting (3) 
for T into (1), the net input to the i th neuron becomes 

h, = £>,“[ £ Pj a p?l (4) 

a=l j 

The important term in the sum on a is the one for which a = p. This term represents the “sig- 
nal’* between the input p* 3 and the desired output. The rest of the sum represents “noise” result- 
ing from crosstalk with all of the other stored patterns. The expression for the net input becomes 
hi = signali + noire, where 

signal, = />,**[£ pf pf\ (5) 

j 

noise, = £ p, a [ (6) 

j 


Summing on all of the j k in (6) yields signal , = (n-l)/?,^. Since n is positive, the sign of 
the signal term and p," will be the same. Thus, if the noise term were exactly zero, the signal 
would give the same sign as d* with a magnitude of - n d y and would be a fixed point of (2). 
Moreover, patterns close to p* would give nearly the same signal, so that p^ should be an attract- 
ing fixed point. 

For randomly chosen patterns, <noise> = 0, where < > indicates statistical expectation, and 
its variance will be a 2 = (#i-l)*(Af-l). The probability that there will be an error on recall of pf 
is given by the probability that the noise is greater than the signal. For n large, die noise distribu- 
tion is approximately gaussian, and the probability that there is an error in the i th bit is 


P* = 


1 

V2 rta 


M 

J 

I signal \ 


e-^n^dx. 


(7) 


INFORMATION CAPACITY 

The number of patterns that can be stored in the network is known as its capacity. 13,14 How- 
ever, for a fair comparison between all of the models discussed here, it is more relevant to com- 
pare the total number of bits (total information) stored in each model rather than the number of 



patterns. This allows comparison of information storage in models with different lengths of the 
pattern vectors. If we view the memory model as a black box which receives input bit strings and 
outputs them with some small probability of error in each bit, then the definition of bit-capacity 
used here is exactly the definition of channel capacity used by Shannon. 15 

Define the bit-capacity as the number of bits that can be stored in a network with fixed pro- 
bability of getting an error in a recalled bit, i.e . p € = constant in (10). Explicitly, the bit-capacity 
is given by * 6 

B = bit capacity = nM r|, (8) 

where t\ = (1 + p t \ogip e + (1-p, )log 2 (l-pJ). Note that x\~l for p,=0. Setting p t to a constant is 
tantamount to keeping the signal-to-noise ratio (fidelity) constant, where the fidelity, R , is given by 
R = \ signal \lo. Explicitly, the relation between (constant) p, and R t is just R = <D (1 -p,), 
where 

R 

<&(/?) = ( 1/2*)* j e - ,la dt. (9) 

Hence, the bit-capacity of these networks can be investigated by examining the fidelity of the 
models as a function of n, M, and R. From (8) and (9) the fidelity of the Hopfield model is is 
R 2 = nl(n (Af-1))* (n»l). Solving for M in terms of (fixed) R and x\, the bit-capacity becomes 
B = T|[(n 2 //? 2 >l-n]. 

The results above can be generalized to models with d' h order interactions. 17 18 The resulting 
expression for the bit-capacity for d' H order interaction models is just 

„d+ 1 

fl=n[— r + ”i- (10) 

A 

Hence, we see that the number of bits stored in the system increases with the order d. However, 
to store these bits, one must pay a price by including more connections in the connection teasor. 
To demonstrate the relationship between the number of connections and the information stored, 
define the information capacity, y, to be the total information stored in the network divided by the 
number of bits in the connection tensor (note that this is different than the definition used by Abu- 
Mostafa et al.). i9 Thus y is just the bit-capacity divided by the number of bits in the tensor 
and represents the efficiency with which information is stored in the network. Since T has n 
elements, the information capacity is found to be 



where b is the number of bits of precision used per tensor element (b > log 2 M for no clipping of 
the weights). For large n , the infoimation stored per neuronal connection is y = r\IR b , indepen- 
dent of the order of the model (compare this result to that of Peretto, et ai). To illustrate this 
point, suppose one decides that the maximum allowed probability of getting an error in a recalled 
bit is p, = 1/1000, then this would fix the minimum value of R at 3.1. Thus, to store 10,000 bits 
with a probability of getting an error of a recalled bit of 0.001, equation (15) states that it would 
take =96,0006 bits, independent of the order of the model, or =0.1/i patterns can be stored with 
probability 1/1000 of getting an error in a recalled bit. 

KANERVA’S SDM 

Now we focus our attention on Kanerva's Sparse, Distributed Memory model (SDM). 11 The 
SDM can be viewed as a 3-layer network with the middle layer playing the role of hidden units. 
To get an autoassociative network, the output layer can be fed back into the input layer, effectively 
making this a two layer network. The first layer of the SDM is a layer of n , ±1 input units (the 
input address, a), the middle layer is a layer of m , hidden units, s, and the third layer consists of 
the n ±1 output units (the data, d). The connections between the input units and the hidden units 
are random weights of ±1 and are given by the mxn matrix A . The connections between the hid- 
den units and the output units are given by the nxm connection matrix C , and these matrix ele- 
ments are modified by an outer-product learning rule (C is analogous to the matrix T of the 
Hopfield model). 


( 12 ) 


Given an input pattern a, the hidden unit activations are determined by 

s = 0 r (A a), 


where 0 r is the Hamming-distance threshold function: The k' H element is 1 if the input a is at 
most r Hamming units away from the k' H row in A , and 0 if it is further than r units away, i.e., 

f l if 'A(n -x, )<r 


«,(*),■ = 1 


0 if l A(n-Xj)>r . 


(13) 


The hidden-units vector, or select vector, s, is mostly Os with an average of 5m Is, where 5 is 
some small number dependent on r ; S«l. Hence, s represents a large, sparsely coded vector of Os 
and 61s representing the input address. The net input, h, to the final layer can be simply expressed 
as the product of C with s: 


f» - C s. (14) 

Finally, the output data is given by d = g(h), where g, (h , ) = sign(hj). 

To store the M patterns, p\p 2 , p M , form the outer-product of these pattern vectors and 
their corresponding select vectors, 

C = IpV’ (15) 

a=l 

where T denotes the transpose of the vector, and where each select vector is formed by the 
corresponding address, s a = 0 r (Ap°). The storage algorithm (15) is an outer-product learning rule 
similar to (3). 

Suppose that the M patterns (p\p 2 , • p M ) have been stored according to (15). Following 

the analysis presented for the Hopfield model, I show that if the system is presented with p p as 
input, the output will be p p , (i.e. p p is a fixed point). Setting a = p p in (16) and separating terms 
as before, the net input (18) becomes 

h = d p (s p s p ) + £ p a (s° s p ). (16) 

a*p 

where the first term represents the signal and the second is the noise. Recall that the select vectors 
have an average of 5m Is and the remainder Os, so that the expected value of the signal is 6m s p . 

Assuming that the addresses and data are randomly chosen, the expected value of the noise 
is zero. To evaluate the fidelity, I make certain approximations. First, I assume that the select vec- 
tors are independent of each other. Second, I assume that the variance of the signal alone is zero 
or small compared to the variance of noise term alone. The first assumption will be valid for 
m 8 2 c 1 , and the second assumption will be valid for Af6»l. With these assumptions, we can 
easily calculate the variance of the noise term, because each of the select vectors are i.i.d. vectors 
of length m with mostly Os and =5 m Is. With these assumptions, the fidelity is given by 


[(A/-l)(l+6 2 m (1-1/m))] 


In the limit of large m , with 6m = constant , the number of stored bits scales as 


B 


= nt 


mn 

R\l+5 2 m) 


+ «]• 


(18) 


If we divide this by the number of elements in C, we find the information capacity, y = r \/R 2 b, 
just as before, so the information capacity is the same for the two models. (If we divide the bit 
capacity by the number of elements in C and A then we get y = iy/? 2 (6+l), which is about the 
same for large M .) 


A few comments before we continue. First, it should be pointed out that the assumption 
made by Kanerva 11 and Keeler 17,18 that the variance of the signal term is much less than that of 
the noise is not valid over the entire range. If we took this into account, then the magnitude of the 
denominator would be increased by the variance of the signal term. Further, if we read at a dis- 
tance / away from the write address, then it is easy to see that the signal changes to be mS(/), 
where 6(/) the overlap of two spheres of radius r length / apart in the binomial space n 


( 19 ) 


(5 = 6(0) ). The fidelity for reading at a distance / away from the write address is 

^2 m 2 § 2 {l) 

w8(/)(l-5(/)) + (Af-l)m8 2 +(M-l)6 4 »i 2 (l-l/»i) ’ 

Compare this to the formula derived by Chou, 12 for the exact signal-to-noise ratio: 

R 2 = ™ 2 8 2 (Q 

m6(/Kl-5(/)) + (A/-l)rop,,,+(M-l)o 2 ,»i 2 (l-l/ro)) ’ 


( 20 ) 


where | x n r is the average overlap of the spheres of radius r binomially distributed with parameters 
(*,1/2) and a 2 is the square of this overlap. The difference in these two formulas lies in the 
denominator in the terms 5 2 verses and 8 4 vs. o 2 ^. The difference comes from the fact that 
Chou correctly calculates the overlap of the spheres without using the independence assumption. 

How do these formula’s differ? First of all, it is found numerically that 8 2 is identical with 
\i n ,. Hence, the only difference comes from 8 4 verses o 2 ^. For *i8 2 < 1, the 8 4 term is negligi- 
ble compared to the other terms in the denominator In addition, 8 4 and a 2 are approximately 
equal for large * and r-nl2. Hence, in the limit n-*°° the two formulas agree over most of the 
range if M ~0.lm , m<2" . However, for finite n , the two formulas can disagree when m 8 2 ~1 (see 
Figure 1). 



Figure 1: A comparison of the fidelity calculations of the SDM for typical *, Af, and m 
values. Equation (17) was derived assuming no variance of the signal term, and is shown 
by the + line. Equation (19) uses the approximation that all of the select vectors are 
independent denoted by the o line. Equation (20) (**s) is the exact derivation done by 
Chou . The values used here were n - 150, m = 2000, M = 100. 


Equation (20) suggests that ther^ is a best read-write Hamming radius for 

ting I = 0 in (19) and by setting — — = 0, we get an approximate expression 

do 

ming radius: 8^* = (2Mm )" . This trend is qualitatively shown in Figure 2. 


the SDM. By set- 
for the best Ham- 



Figure 2: Numerical investigation of the capacity of the SDM. The vertical axis is the per- 
cent of recovered patterns with no errors. The x-axis (left to right) is the Hamming dis- 
tance used for reading and writing. The y-axis (back to forward) is the number of patterns 
that were written into the memory. For this investigation, n = 128, m = 1024, and M 
ranges from 1 to 501. Note the similarity of a cross-section of this graph at constant M 
with Figure 1. This calculation was performed by David Cohn at R1ACS, NASA- Ames. 


Figure 1 indicates that the formula (17) that neglected the variance of the signal term is 
incorrect over much of the range. However, a variant of the SDM is to constrain the number of 
selected locations to be constant; circuitry for doing this is easily built. 21 The variance of the sig- 
nal term would be zero in that case, and the approximate expression for the fidelity is given by Eq. 
(17). There are certain problems where it would be better to keep 8 = constant, as in the case of 
correlated patterns (see below). 

The above analysis was done assuming that the elements (weights) in the outer-product 
matrix are not clipped i.e. that there are enough bits to store the largest value of any matrix ele- 
ment It is interesting to consider what happens if we allow these values to be represented by only 
a few bits. If we consider die case case b = 1, i.e. the weights are clipped at one bit, it is easy 
to show 17 that Y~2T)/rti? 2 for the d' h order models and for the SDM, which yields y = 0.07 for rea- 
sonable R , (this is substantially less than Willshaw’s 0.69). 



SEQUENCES 


In an autoassociative memory, the system relaxes to one of the stored patterns and stays 
fixed in time until a new input is presented. However, there are many problems where the recalled 
patterns must change sequentially in time. For example, a song can be remembered as a string of 
notes played in the correct sequence; cyclic patterns of muscle contractions are essential for walk- 
ing, riding a bicycle, or dribbling a basketball. As a first step we consider the very simplistic 
sequence production as put forth by Hopfield (1982) and Kanerva (1984). 

Suppose that we wished to store a sequence of patterns in the SDM. Let the pattern vectors 

be given by (p'.p 2 p M ). This sequence of patterns could be stored by having each pattern 

point to the next pattern in the sequence. Thus, for the SDM, the patterns would be stored as 
input-output pairs (a“,d a ), where a 4 = p“ and d“ = p " 1 for a= 1,2,3,. ..M-l. Convergence to 
this sequence works as follows: If the SDM is presented with an address that is close to p the 
read data will be close to p 2 . Iterating the system with p 2 as the new input address, the read data 
will be even closer to p\ As this iterative process continues, the read data will converge to the 
stored sequence, with the next pattern in the sequence being presented at each time step. 

The convergence statistics are essentially the same for sequential patterns as that shown 
above for autoassociative patterns. Presented with p“ as an input address, the signal for the stored 
sequence is found as before 

<signal> = 5 m p a+1 . (21) 

Thus, given p a , the read data is expected to be p° +1 . Assuming that the patterns in the sequence 
are randomly chosen, the mean value of the noise is zero, with variance 

«j 2 > = (Af-l)5 2 m(l+8 2 (m -1)). (22) 

Hence, the length of a sequence that can be stored in the SDM increases linearly with m for large 
m . 

Attempting to store sequences like this in the Hopfield model is not very successful due to 
the asynchronous updating use in the Hopfield model. A synchronously updated outer-product 
model (for example [6]) would work just as described for the SDM, but it would still be limited to 
storing fraction of the word size as the maximum sequence length. 

Another method for storing sequences in Hopfield-like networks has been proposed indepen- 
dently by Kleinfeld 22 and Sompolinsky and Ranter. 21 These models relieve the problem created by 
asynchronous updating by using a time-delayed sequential term. This time-delay storage algorithm 
has different dynamics than the synchronous SDM model. In the time-delay algorithm, the system 
allows time for the units to relax to the first pattern before proceeding on to the next pattern, 
whereas in the synchronous algorithms, the sequence is recalled imprecisely from imprecise input 
for the first few iterations and then correctly after that. In other words, convergence to the 
sequence takes place “on the fly” in the synchronous models — the system does not wait to zero 
in on the first pattern before proceeding on to recover the following patterns. This allows the syn- 
chronous algorithms to proceed k times as fast as the asynchronous time-delay algorithms with 
half as many (variable) matrix elements. This difference should be able to be detected in biological 
systems. 

TIME DELAYS AND HYSTERESIS: FOLDS 

The above scenario for storing sequences is inadequate to explain speech recognition or pat- 
tern generation. For example, the above algorithm cannot store sequences of the form ABAC , or 
overlapping sequences. In Kanerva’s original work, he included the concept of time delays as a 
general way of storing sequences with hysteresis. The problem addressed by this is the following. 
Suppose we wish to store two sequences of patterns that overlap. For example, the two pattern 
sequences (a,b,c,d,e/,...) and (x,y,z,d,w,v,...) overlap at the pattern d. If the system only has 
knowledge of the present state, then when given the input d, it cannot decide whether to output w 
or e. To store two such sequences, the system must have some knowledge of the immediate past. 
Kanerva incorporates this idea into the SDM by using “folds.” A system with F+l folds has a 
time history of F past states. These F states may be over the past F time steps or they may go 
even further back in time, skipping some time steps. The algorithm for reading from the SDM with 
folds becomes 


d(f+l) = g(C°s(<) + C 1- s(/-Ti) + ••• +C f s(r-r f )), 


(23) 



where s(r- / t^= 0 r (i4a(/-tp)). Ty store the Q pattern sequences (Pi'.Pi 2 . - < - >Pi *). 
(p 2 *p 2 P 2 *),... (Pg-Pg Pc®), construct the matrix of the p'* fold as follows: 

(24) 

0C» l T*1 

t-t t— t 

where any vector with a superscript less than 1 is taken to be zero, s tt * = Q r (A p a '), and is a 
weighting factor that would normally decrease with increasing p. 

Why do ^Jfiese folds work? Suppose that the system is presented with the pattern sequence 

(pi\pt 2 , . . . , Pt ! ), with each pattern presented sequentially as input until the x F time step. For 
simplicity, assume that wp = 1 for all JJ. Each term in Eq. (39) will contribute a signal similar to 
the signal for the single-fold system. Thus, on the X th time step, the signal term coming from Eq. 
(39) is <signal(f+l)> = F&np** 1 . The signal will have this value until the end of the pattern 
sequence is reached. The mean of the noise terms is zero, with variance 
■cnoiseS = F(A/-l)8 2 ffi(l+8 2 (m-l)). Hence, the signal-to-noise ratio is VF~ times as strong as it 
is for the SDM without folds. 

Suppose further that the second stored pattern sequence happens to match the first stored 
sequence at / = t. The signal term would then be 

signal(f+l) = F8mpf* 1 + 8m P 2 * 1 . (25) 

With no history of the past (F = 1) the signal is split between p** 1 and pj +l , and the output is 
ambiguous. However, for F>1, the signal for the first pattern sequence dominates and allows 
retrieval of the remainder of the correct sequence. This formulation allows context to aid in the 
retrieval of stored sequences, and can differentiate between overlapping sequences by using time 
delays. 

The above formulation is still too simplistic in terms of being able to do real recognition 
problems such as speech recognition. First, the above algorithm can only recall sequences at a 
fixed time rate, whereas speech recognition occurs at widely varying rates. Second, the above 
algorithm does not allow for deletions in the incoming data. For example “seqnce" can be recog- 
nized as “sequence” even though some letters are missing. Third, as pointed out by Lashley 74 
speech processing relies on hierarchical structures. 

Although Kanerva’s original algorithm is too simplistic, a straightforward modification 
allows retrieval at different rates with deletions. To achieve this, we can add on the time-delay 
terms with weights which are smeared out in time. Kanerva’s (1984) formulation can thus be 
viewed as a discrete-time formulation of that put forth by Hopfield and Tank, (1987). 25 Explicitly 
we could write 

F p 

h = I £ WVCfyf-'tp.*), (26) 

P - I k^F 

where the coefficients are a discrete approximation to a smooth function which spreads the 
delayed signal out over time. As a further step, we could modify these weights dynamically to 
optimize the signal coming out. The time-delay patterns could also be placed in a hierarchical 
structure as in the matched filter avalanche structure put forth by Grossberg et al. (1986). 26 

CORRELATED PATTERNS 

In the above associative memories, all of the patterns were taken to be randomly chosen, 
uniformly distributed binary vectors of length n . However, there are many applications where the 
set of input patterns is not uniformly distributed; the input patterns are correlated. In mathematical 
terms, the set k of input patterns would not be uniformly distributed over the entire space of 2 * 
possible patterns. Let the probability distribution function for the Hamming distance between two 
randomly chosen vectors p a and p^ from the distribution k be given by the function p(</(p a -p*)) f 
where d (x-y) is the Hamming distance between x and y. 

The SDM can be generalized from Kanerva’s original formulation so that correlated input 
patterns can be associated with output patterns. For the moment, assume that the distribution set 
k and the probability density function p(jt ) are known a priori. Instead of constructing the rows 
of the matrix A from the entire space of 2* patterns, construct the rows of A from the distribution 
k. Adjust the Hamming distance r so that £ = 8m = constant number of locations are selected. 



In other words, adjust r so that the value of 8 is the same as given above, where 5 is determined 
by 


jp(x)dx 

8= . (27) 

2 " 


This implies that r would have to be adjusted dynamically. This could be done, for example, by a 
feedback loop. Circuitry for doing this is easily built, and a similar structure appears in the 
Golgi cells in the Cerebellum. 27 . 

Using the same distribution for the rows of A as the distribution of the patterns in k, and 
using (27) to specify the choice of r, all of the above analysis is applicable (assuming randomly 
chosen output patterns). If the outputs do not have equal Is and -Is the mean of the noise is not 
0. However, if the distribution of outputs is also known, the system can still be made to work by 
storing l/p + and Up. for Is and -Is respectively, where p± is the probability of getting a 1 or a -1 
respectively. Using this storage algorithm, all of the above formulas hold, (as long as the distribu- 
tion is smooth enough and not extremely dense). The SDM will be able to recover data stored 
with correlated inputs with a fidelity given by Equation (17). 

What if the distribution function k is not known a priori? In that case, we would need to 
have the matrix A learn the distribution p(x). There are many ways to build A to mimic p. One 
such way is to start with a random A matrix and modify the entries of 8 randomly chosen rows of 
A at each step accordingto the statistics of the most recent input patterns. Another method is to 
use competitive learning 2 * -30 to achieve the proper distribution of A*. 


The competitive learning algorithm is a method for adjusting the weights A (y between the 
first and second layer to match this probability density function, p(x). The i" row of the address 
matrix A can be viewd as a vector A, . The competitive learning algorithm holds a competition 
between these vectors, and a few vectors that are the closest (within the Hamming sphere r) to the 
input pattern x are the winners. Each of these winners are then modified slightly in the direction 
of x. For large enough m % this algorithm almost always converges to a distribution of the A, that 
is the same as p(x). xXx The updating equation for the selected addresses is just 


A ~w = A|? w _ X(A f u - x) 


(28) 


Note for \ = 1, this reduces to the so-called unary representation of Baum et al . 31 Which gives 
the maximum efficiency in terms of capacity. 


DISCUSSION 


The above analysis said nothing about the basins of attraction of these memory states. A 
measure of the performance of a content addressable memory should also say something about the 
average radius of convergence of the basin of attraction. The basins are in general quite compli- 
cated" and have been investigated numerically for the unclipped models and values of n and m 
ranging in the 100s. 21 The basins of attraction for the SDM and the <i= 1 model are very similar in 
their characteristics and their average radius of convergence. However, the above results give an 
upper bound on the capacity by looking at the fixed points of the system (if there is no fixed point, 
there is no basin). 

In summary, the above arguments show that the total information stored in outer-product 
neural networks is a constant times the number of connections between the neurons. This constant 
is independent of the order of the model and is the same (t| /R 2 b) for the SDM as well as higher- 
order Hopfield-type networks. The advantage of going to an architecture like the SDM is that the 
number of patterns that can be stored in the network is independent of the size of the pattern, 
whereas the number of stored patterns is limited to a fraction of the word size for the Willshaw or 
Hopfield architecture. The point of the above analysis is that the efficiency of the SDM in terms 
of information stored per bit is the same as for Hopfield-type models. 

It was also demonstrated how sequences of patterns can be stored in the SDM, and how time 
delays can be used to recover contextual information. A minor modification of the SDM could be 
used to recover time sequences at slightly different rates of presentation. Moreover, another minor 
modification allows the storage of correlated patterns in the SDM. With these modifications, the 
SDM presents a versatile and efficient tool for investigating properties of associative memory. 
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